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We  have  pursued  a  number  of  projects  in  Topological  Data  Analysis.  They  concern  two  aspects  of 
the  subject,  namely  the  “measurement  of  shape”  via  homological  signatures  and  the  compressed 
representation  of  the  shape  of  a  data  set.  Homological  signatures  from  conventional  topology  have 
been  extended  to  the  notion  of  persistent  homology  [18],  which  is  applicable  to  finite  samples  from 
spaces,  perhaps  with  error,  rather  than  only  to  the  complete  spaces  themselves.  The  properties 
of  persistent  homology  are  under  intense  development,  and  part  of  this  project  constitutes  work 
in  that  direction.  We  have  also  developed  additional  applications  of  persistent  homology  in  the 
study  of  evolution  of  viruses.  Compressed  representations  of  point  cloud  data  sets  in  the  form  of 
simplicial  complexes  are  constructed  using  the  Mapper  methodology  [4],  [15],  and  in  this  project  we 
have  developed  applications  of  the  methodology  as  well  as  additional  theory  describing  the  stability 
of  the  construction. 


1.  Zig-zag  persistence:  Persistent  homology  is  a  methodology  which  has  been  developed  to 
infer  the  topological  properties  of  data  sets,  properly  understood.  The  translation  of  point 
cloud  data,  which  consists  of  discrete  sets  of  points  with  an  associated  distance  function, 
into  geometric  objects  is  done  via  a  number  of  ways,  including  the  Vietoris-Rips  complex. 
An  important  feature  is  that  persistent  homology  tracks  the  behavior  of  these  complexes 
across  a  range  of  values  of  a  threshhold  parameter,  and  that  the  actual  geometric  features 
(as  opposed  to  smaller  features  which  represent  noise)  are  those  which  persist  (or  are  stable) 
across  a  large  range  of  threshhold  values,  as  that  threshhold  value  increases.  The  complexes 
increase  with  the  parameter,  and  therefore  there  are  induced  maps  between  them.  There  are, 
however,  a  number  of  situations  where  one  would  like  to  track  values  and  assess  consistency 
of  a  homological  invariant  over  a  set  of  values  for  which  the  complex  is  not  increasing  with 
a  scale  parameter.  One  example  of  this  is  where  one  selects  a  large  number  of  samples  from 
a  very  large  data  set.  In  this  case,  one  wants  to  study  persistence  of  features  detected  by 
homology  across  a  range  of  samples  rather  than  across  the  values  of  an  increasing  parameter. 
A  second  situation  is  where  one  is  applying  the  so-called  witness  complex  method  [10].  Here 
one  approximates  the  topology  using  a  complex  based  on  a  set  of  landmark  points,  and  to 
assess  the  fidelity  of  the  construction  it  is  useful  to  compare  the  topologies  across  different 
choices  of  landmarks.  A  third  situation  concerns  time  varying  data,  where  one  studies  data 
sets  for  various  time  slices,  which  might  overlap  but  in  which  neither  is  included  in  the  other. 
A  solution  to  these  problems  is  provided  by  zig-zag  persistence  [5],  [6].  This  method  has 
been  studied  in  a  number  of  situations  during  the  course  of  the  grant  [17],  with  exploratory 
work  confirming  that  it  functions  as  an  effective  tool  for  assessing  the  consistency  across 
samples,  across  landmarks,  and  across  different  choices  of  a  variance  parameter  in  kernel 
density  estimators. 

2.  Applications  of  Mapper:  In  [15],  a  method  was  introduced  for  representing  general  data 
sets  in  compressed  form  as  a  simplicial  complex.  The  idea  is  that  a  simplicial  complex  is  an 
ideal  representation  for  a  data  set,  because  by  comparison  with  a  scatterplot  representation 
one  obtains  both  compression  of  the  structure  as  well  as  additional  resolution,  and  by  com¬ 
parison  with  standard  clustering  methods  one  also  obtains  additional  resolution.  The  method 


had  previously  been  used  to  study  folding  problems  in  molecular  dynamics  [2].  We  have  con¬ 
tinued  to  demonstrate  the  value  of  the  technique  in  several  directions,  including  microarray 
studies  of  breast  cancer,  in  the  study  of  fragile-X  (an  autism  related  syndrome),  and  in  poli¬ 
tics  and  sports  [12],  [14],  [11],  In  the  case  of  breast  cancer,  it  permitted  the  identification  of 
a  genomic  prohle  which  characterizes  a  group  of  patients  (roughly  8%  of  the  patients)  who 
all  survived  the  length  of  the  study.  In  the  case  of  fragile-X,  the  finding  was  a  decomposi¬ 
tion  of  all  the  patients  into  two  distinct  groups,  with  distinct  behaviors.  The  methodology 
makes  it  very  simple  to  color  the  networks  produced  by  variables  of  interest,  making  very 
clear  the  effects  of  various  variables,  which  is  very  useful  in  understanding  the  relationships 
of  the  variables.  The  pertinence  to  data  fusion  comes  from  the  fact  that  it  is  quite  natural  to 
apply  the  methodology  to  the  set  of  “columns”  attached  to  a  data  set,  rather  than  the  rows. 
This  might  mean  the  set  of  genes  or  gene  sets  used  as  coordinates  in  microarray  studies,  or 
it  might  mean  the  collection  of  sensors  used  in  the  study  of  geological  data.  The  study  of 
this  kind  of  “sensor  space”  can  detect  many  important  pieces  of  information  concerning  the 
method  of  collecting  data.  For  example,  one  might  discover  that  a  very  large  set  of  sensors 
is  highly  correlated.  When  this  is  the  case,  they  will  tend  to  dominate  the  geometry  of  the 
data  set  of  rows,  and  one  might  want  to  compensate  for  this  by  dividing  by  the  density  in 
the  column  space  to  mitigate  the  problem. 

3.  Texture  analysis:  An  earlier  application  of  topological  methods  in  data  analysis  was  the 
application  of  persistent  homology  to  identify  the  topological  type  of  the  space  of  high  den¬ 
sity  high  contrast  3x3  image  patches  in  natural  images  [3].  Work  on  this  award  included 
an  application  of  that  hnding  to  design  a  method  for  discriminating  textures  based  on  the 
topological  structure  of  the  space  of  patches.  The  methods  relies  on  finding,  for  each  high 
contrast  patch  in  the  texture  patch,  the  closest  point  to  the  Klein  bottle,  and  deriving  from 
it  a  distribution  of  the  Klein  bottle.  Fourier  analysis  is  then  carried  out  on  the  Klein  bottle 
to  obtain  coordinates  which  discriminate  well  between  patches  from  a  benchmark  data  set 
of  texture  patches.  One  strong  point  of  the  method  is  the  fact  that  the  behavior  of  the  co¬ 
ordinates  under  rotation  of  the  patch  is  governed  by  a  simple  transformation  law  involving 
translation  within  the  Klein  bottle.  It  also  suggest  the  possibility  that  understanding  of  the 
geometry  and  topology  of  frequently  occurring  patches  in  other  imaging  modalities  should 
yield  ways  of  understanding  texture  like  situations  in  those  modalities. 

4.  Coordinates  in  bar  code  space:  Persistence  bar  codes  have  been  shown  to  be  useful 
for  understanding  the  topology  and  geometry  of  individual  data  sets  in  [3],  [16],  and  [8]. 
However,  they  can  also  be  used  in  situations  where  the  data  points  themselves  are  equipped 
with  geometric  structure.  For  example,  databases  of  chemical  compounds  and  of  images  have 
this  property.  In  this  situation,  by  assigning  barcodes  to  the  data  points,  we  obtain  a  database 
of  barcodes.  Geometric  structures  on  the  collection  of  barcodes  are  in  this  case  important  for 
the  purposes  of  analyzing  the  database,  using,  for  example,  methods  from  machine  learning. 
One  such  structure  is  the  bottleneck  distance  [9],  which  is  a  metric  imposed  on  the  set  of 
all  persistence  barcodes.  One  might  also  attempt  to  find  a  coordinate  system  on  the  set  of 
barcodes,  for  a  more  direct  analysis.  This  idea  has  been  explored  and  implemented  during 
this  project  [I].  Specihcally,  we  hnd  that  the  set  of  all  barcodes,  with  an  equivalence  relation 
which  permits  the  deletion  of  any  bars  of  length  zero,  can  be  described  as  a  colimit  of  varieties, 
and  has  a  ring  of  functions  A  of  the  form 


A  =  0  <  i,  1  <  j] 

where  Tij  is  the  function  which  assigns  to  each  barcode  {[xs,ys]}s  the  sum 

+  XsY 

s 

This  idea  can  now  be  used  to  good  effect  to  carry  out  the  analysis  of  databases  of  chemical 
compounds,  as  has  been  demonstrated  by  other  investigators.  It  also  suggests  a  direction  of 
attack  on  questions  concerning  multidimensional  persistence.  Multidimensional  persistence  is 
known  not  to  possess  a  barcode  description  analogous  to  that  for  single  variable  persistence. 
However,  it  appears  likely  that  this  ring  of  functions  can  be  extended  to  a  ring  of  functions  on 
sets  of  multidimensional  persistence  profiles.  This  would  be  extremely  powerful,  since  it  is  by 
now  clear  to  most  investigators  that  multidimensional  persistence  profiles  are  of  fundamental 
importance  in  the  study  of  various  kinds  of  data  sets.  A  two  dimensional  profile  based  on 
both  a  scale  or  distance  variable  as  well  as  a  density  variable  would  be  extremely  useful,  and 
this  work  points  in  that  direction. 
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